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ABSTRACT 

Designing a scientific software stack to meet the needs of 
the next-generation of mesh-based simulation demands, not 
only scalable and efficient mesh and data management on 
a wide range of platforms, but also an abstraction layer 
that makes it useful for a wide range of application codes. 
Common utility tasks, such as file I/O, mesh distribution, 
and work partitioning, should be delegated to external li¬ 
braries in order to promote code re-use, extensibility and 
software interoperability. In this paper we demonstrate the 
use of PETSc’s DMPlex data management API to perform 
mesh input and domain partitioning in Fluidity, a large scale 
CFD application. We demonstrate that raising the level of 
abstraction adds new functionality to the application code, 
such as support for additional mesh file formats and mesh re¬ 
ordering, while improving simlutation startup cost through 
more efficient mesh distribution. Moreover, the separation 
of concerns accomplished through this interface shifts criti¬ 
cal performance and interoperability issues, such as scalable 
I/O and file format support, to a widely used and supported 
open source community library, improving the sustainabil¬ 
ity, performance, and functionality of Fluidity. 
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1. INTRODUCTION 

Scalable file I/O and efficient domain topology management 
present important challenges for many scientific applications 
if they are to fully utilise future exascale computing re¬ 
sources. Although these operations are common to many 
scientific codes they have received little attention in opti¬ 
misation efforts, resulting in potentially severe performance 
bottlenecks for realistic simulations that require and gener¬ 
ate large data sets. Moreover, due to a multitude of formats 
and a lack of convergence on standards for mesh and output 
data in the community there is only limited interoperability 
and very little code reuse among scientific applications for 


common operations, such as reading and partitioning input 
meshes. Thus developers are often forced to create custom 
I/O routines or even use application-specific file formats, 
which further limits application portability and interoper¬ 
ability. 

Designing a scientific software stack to meet the needs of 
the next-generation of simulation software technologies de¬ 
mands, not only scalable and efficient algorithms to perform 
data I/O and mesh management at scale, but also an ab¬ 
straction layer that allows a wide variety of application codes 
to utilise them and thus promotes code reuse and interoper¬ 
ability. Such an intermediate representation of mesh topol¬ 
ogy has recently been added to PETSc [3], a widely used 
scientific library for the scalable solution of partial differen¬ 
tial equations, in the form of the DMPlex data management 
API [l4j. 

In this paper we demonstrate the use of PETSc’s DMPlex 
API to perform mesh input and domain topology manage¬ 
ment in Fluidity [T?], a large scale CFD application code 
that already uses the PETSc library as its linear solver en¬ 
gine. By utilising DMPlex as the underlying mesh man¬ 
agement abstraction we not only add support for new mesh 
file formats, such as Exodus II 20, CGNS 18], Gmsh 9 , 
Fluent Case 2 and MED I], to Fluidity, but also enable 
the use of domain decomposition methods, data migration, 
and mesh renumbering techniques at run-time. Moreover, 
the separation of concerns allows PETSc parallel data man¬ 
agement and HDF5 support to be independently optimized 
for the target platform, removing this complexity from the 
application code. Our refactoring of Fluidity provides sig¬ 
nificant performance benefits due to improved cache locality 
and mesh distribution during simulation initialisation, which 
we demonstrate with performance benchmarks performed on 
Archer, a Cray XC30 architecture. 

2. BACKGROUND 

The key challenge in designing software for large scale sys¬ 
tems lies in the composition of abstractions and the defini¬ 
tion of clearly defined yet flexible interfaces between them. 
Code reuse and inter-disciplinary cooperation necessitate 
deeper software stacks and thus configuration and exten¬ 
sibility will play a key role in designing the software stack of 
the future [5]. In this paper we therefore focus on the inter¬ 
action between applications and their supporting libraries 
to provide the infrastructure for efficient data management 
at exascale. 


2.1 Fluidity 

The primary user application in our work is Fluidity, an 
open source unstructured finite element code that uses mesh 
adaptivity to accurately represent a wide range of scales in 
a single numerical simulation without the need for nested 
grids. Fluidity is used in a number of different scientific areas 
including geophysical fluid dynamics, computational fluid 
dynamics, ocean modelling and mantle convection. Fluidity 
implements various finite element and finite volume discreti¬ 
sation methods and is capable of solving solving the Navier- 
Stokes equation and accompanying field equations in one, 
two and three dimensions. 


Previous optimisation efforts have highlighted that file I/O, 
in particular during model initialisation, presents a severe 
performan ce b ottleneck when running on large numbers of 
processes til]. The primary reasons for this are a off-line 
domain partitioning and the need to store each partition 
using a file-per-process strategy. 


2.2 DMPlex 

PETSc’s ability to handle unstructured meshes is centred 
around DMPlex, a data management object that encapsu¬ 
lates the topology of unstructured grids to provide a range of 
functionalities common to many scientific applications. As 
shown in Figure [I] DMPlex stores the connectivity of the 
associated mesh as a layered directed acyclic graph (DAG), 
where each layer (stratum) represents a class of topological 
entities [l4 16 . This flexible yet efficient representation pro¬ 
vides an abstract interface for the implementation of mesh 
management and manipulation algorithms using dimension- 
independent programming. 
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Figure 1: DAG-based representation of a single tetrahedron 
in DMPlex. 


DMPlex stores data by associating data with points in the 
DAG, allowing an arbitrary data size for each point. This 
can be efficiently encoded using the same AIJ data structure 
used for sparse matrices. This scheme is general enough 
to encompass any discrete data layout over a mesh. The 
association with points also means that data can be moved 
using the same set of scalable primitives that are used for 
mesh distribution. 

DMPlex’s internal representation of mesh topology also pro¬ 
vides an abstraction layer that decouples the mesh from the 
underlying file format and thus allows support for multi¬ 
ple mesh file formats to be added generically. At the time 
of writing DMPlex is capable of reading input meshes in 
Exodus II, CGNS, Gmsh, Fluent-Case and MED formats. 
Moreover, DMPlex provides output routines that generate 


solution output in HDF5-based XDMF format, while also 
storing the DMPlex DAG connectivity alongside the visual- 
isable solution data to facilitate checkpointing [3]. 

In addition to a range of I/O capabilities DMPlex also pro¬ 
vides parallel data marshalling through automated paral¬ 
lel distribution of the DMPlex 151 and the pre-allocation 
of parallel matrix and vector data structures. Mesh par¬ 
titioning is provided via internal interfaces to several par- 
titioner libraries (Chaco, Metis/ParMetis) and data migra¬ 
tion is based on PETSc’s internal Star Forest communica¬ 
tion abstraction (PetscSF) [3j. Additionally, DMPlex is de¬ 
signed to provide the connectivity data and grid hierarchies 
required by sophisticated preconditioners, such as geomet¬ 
ric multigrid methods and “Fieldsplit” preconditioning for 
multi-physics problems, to speed up the solution process El 

I- 


2.3 Mesh Reordering 

Mesh reordering techniques represent a powerful performance 
optimisation that can be utilised to increase cache coherency 
of the matrices required during the solution process 110, 12, 
21 . The well-known Reverse Cuthill-McKee (RCMjalgo- 


rithm, which can be used to reduce the bandwidth of CSR 
matrices, is implemented in PETSc allowing DMPlex to 
compute the required permutation of mesh entities directly 
from the domain topology DAG. The resulting permutation 
can then be applied to any discretisation derived from the 
stored mesh topology to improve the cache coherency of the 
associated CSR matrices. 


3. FLUIDITY-DMPLEX INTEGRATION 

Initial mesh input has been a scalability bottleneck in Flu¬ 
idity due to the off-line mesh partitioning step. As illus¬ 
trated in Figure |2a[ the current preprocessor module uses 
Zoltan [8], which use ParMetis 113 for graph partitioning, 
to partition and distribute the initial simulation state to the 
desired number of processes before writing the partitioned 
mesh and data to disk, allowing the main simulation to read 
the pre-part it ioned data in a parallel fashion. 

Fluidity’s parallel mesh initialisation routines, however, rely 
on a file-per-process I/O strategy that require large num¬ 
bers of individual files when running the application at scale. 
This has been shown to put significant pressure on the meta¬ 
data servers in distributed filesystems, such as Lustre or 
PVFS, which ultimately has a detrimental effect on scala¬ 
bility when using sufficiently large numbers of processes 11 . 


3.1 Parallel Simulation Start-up 

One of the objectives of this work, in addition to enhacing 
functionality and usability, is to alleviate Fluidity’s start-up 
bottleneck by utilising DMPlex’s mesh distribution capa¬ 
bilities to perform mesh partitioning at run-time. For this 
purpose, as shown in Figure [2b] a DMPlex topology ob¬ 
ject is created from the initial input mesh and immediately 
partitioned and distributed to all participating processes, al¬ 
lowing Fluidity’s initial coordinate field to be derived from 
the DMPlex object in parallel. From the initial coordinate 
mesh all further discretisations and fields in the simulation 
state are then derived using existing functionality. 






Using DMPlex as an intermediate representation for the un¬ 
derlying mesh topology has the following advantages: 


assumption), a node permutation is required when deriv¬ 
ing Fluidity data structures from the mapping provided by 
DMPlex. 


• Run-time mesh distribution and load balancing removes 
the need to store the partitioned mesh on disk, thus 
removing two costly I/O operations and reducing Flu¬ 
idity’s disk space requirements. 

• Communication volume during startup is reduced, since 
only the topology graph is distributed. This is in con¬ 
trast to the preprocessor, which partitions and dis¬ 
tributes a fully allocated Fluidity state with multiple 
fields. 

• Support for multiple previously unsupported mesh file 
formats is inherited from DMPlex, increasing applica¬ 
tion interoperability. 



(a) Original Fluidity start-up based on off-line pre-processing. 
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(b) DMPlex-based start-up where an initial DMPlex object is 
distributed at run-time. 
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(c) Potential future workflow, where DMPlex performs the ini¬ 
tial mesh read in parallel before a parallel load balancing step. 


Figure 2: Workflow diagram for Fluidity simulation start¬ 
up. 


A key point to note about the DMPlex-based mesh initiali¬ 
sation approach is that by delegating the initial mesh read to 
PETSc any mesh format reader added to DMPlex in the fu¬ 
ture will automatically be inherited by the application code. 
Moreover, future performance optimisations, such as paral- 
lisation of the initial mesh file read, will also be available 
in Fluidity without any further changes to the application. 
Such an envisaged scenario is shown in Figure [2c] where an 
already parallel DMPlex object is created from the initial 
file, followed by a load balancing step before deriving the 
parallel Fluidity state. 

3.2 Mesh Renumbering 

One of the key components of the DMPlex integration is 
the derivation of Fluidity’s initial coordinate mesh object 
from the distributed DMPlex, which includes the derivation 
of the data mapping required for halo exchanges in parallel. 
DMPlex is able to provides such a mapping from local non- 
owned degrees-of-freedom (DoFs) to their respective remote 
owners. However, since Fluidity halo objects require remote 
non-owned DoFs in the solution field to be located contigu¬ 
ously at the end of the solution vector (“trailing receives” 


As a consequence, further node ordering permutations may 
be applied during the derivation of the initial field discreti¬ 
sation, such as the RCM renumbering provided by DMPlex. 
An optional renumbering step can be performed locally after 
the initial mesh distribution and added to the mesh initial¬ 
isation routine with very little programming effort. As a 
result of Fluidity’s run-time derivation of field discretisa¬ 
tions from the underlying coordinate mesh, the RCM data 
layout of the initial reordering is inherited by all fields in the 
simulation, as shown in Figure [ 3 ] Moreover, as new mesh 
renumbering schemes are incorporating into PETSc, they 
will be automatically available to the application code. 
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Figure 3: Effects of RCM reordering on the structure of the 
assembled pressure matrix. 


4. RESULTS 

The following benchmark tests were performed on the UK 
national supercomputer, a Cray XE30 with 4920 nodes con¬ 
nected via an Aries interconnect and a parallel Lustre filesys¬ 
tem Q Each node consists of two 2.7 GHz, 12-core Intel 
E5-2697 v2 (Ivy Bridge) processors with 64GB of memory. 

The benchmark runs simulate the flow past a square cylinder 
for 10 timesteps using a P^ G — P 2 discretisation [t] , where a 
second-order pressure field is solved using Fluidity’s multi¬ 
grid algorithm and paired with a discontinuous first order 
velocity field that uses a GMRES solver with SOR precon¬ 
ditioning. The mesh used has been generated with the Gmsh 
mesh generator [9 and is shown in Figure [4] 

1 http://www.archer.ac.uk/ 













































































































Figure 4: Three-dimensional benchmark mesh used to model 
flow past a cylinder. 


4.1 Mesh Initialisation 

Figure [5] shows a comparison of the simulation start-up cost 
between the DMPlex-based implementation and the original 
preprocessor approach on 4 nodes (96 cores) with increasing 
mesh sizes up to approximately 3 million elements (weak 
scaling). The original start-up cost is hereby quantified as 
the sum of preprocessor and simulation initialisation times. 

A clear improvement in overall start-up performance is shown 
in Figure [5a| although no significant improvement in direct 
file I/O, as shown in Figure [5b] can be determined. This 
is unsurprising, as file I/O is still completely dominated by 
the initial sequential read, although potential gains can be 
expected at larger scales due to removing the intermediate 
reads and writes of the partitioned mesh. 

As highlighted in Figure [5c| the majority of the observed 
overall performance gains can be attributed to significantly 
improved mesh distribution via DMPlex. It is important 
to note here that DMPlex partitions and migrates only the 
mesh topology graph and its associated coordinate values, in 
contrast to the original preprocessor module that distributes 
fully assembled fields. As a result less data needs to be 
communicated during the mesh migration phase, resulting 
in significantly increased start-up performance. 

4.2 Mesh Renumbering 

The overall simulation performance, including the effects of 
the mesh reordering derived from DMPlex, are evaluated 
in Figure [6] This benchmark compares the performance 
of both implementations by running 10 timesteps of the 
full simulation using a mesh with approximately 3 million 
elements on up to 96 cores. The results, shown in Fig¬ 
ure |6a| indicate a consistent performance improvement of 
the DMPlex-based model with native mesh ordering over the 
preprocessor approach that increases with growing numbers 
of processes due to a smaller start-up overhead. 

The effect of RCM mesh reordering is best demonstrated by 
analysing the two most expensive components of the simula¬ 
tion: the pressure field solve (Figure [6b| and the assembly of 
the velocity matrix (Figure [6c|. Both components exhibit 
significant performance increases with RCM reordering on 
small numbers of processors that diminish as the simulation 
approaches the strong scaling limit. However, the benefits 
for overall simulation performance (Figure |6c| with RCM 
reordering decrease between 24 and 96 processes due to the 
fixed-cost start-up overhead of generating the permutation 
outweighing the solver and assembly benefits. 



(a) Overall simulation start-up 



(b) Mesh file I/O 



Figure 5: Comparison of total start-up cost between prepro¬ 
cessor and DMPlex-based approaches. 


5. DISCUSSION 
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(a) Overall simulation performance. 



(b) Pressure solve performance. 



(c) Velocity assembly performance. 

Figure 6: Full simulation performance for 10 timesteps on 
approximately 3 million elements. 


Achieving scalable performance with production-scale scien¬ 
tific applications on future exascale systems requires appro¬ 
priate levels of abstraction across the entire software stack. 
In this paper we report progress on the integration of DM- 
Plex, a library-level domain topology abstraction, with the 
application code Fluidity in order to delegate a set of com¬ 
mon mesh and data management tasks to a widely used li¬ 
brary. We highlight the increased interoperability achieved 
through the inheritance of new mesh file format readers 
and demonstrate improved model initialisation performance 
through run-time mesh distribution routines provided by 
DMPlex. 

The key benefit of the restructured model initialisation work- 
flow, however, lies in the fact that responsibility for support¬ 
ing various mesh file formats and optimising mesh file I/O 
now lies with the underlying library. This entails that any 
future extensions, such as new file formats or parallel mesh 
reader implementations, are automatically inherited by Flu¬ 
idity, as well as other applications using PETSc, such as 
Firedrake ||19] where we have also employed these abstrac¬ 
tions. Moreover, by utilising a centralised mesh management 
API other types of mesh-based performance optimisations 
become available to the application, as highlighted by the 
seamless addition of the RCM renumbering feature. 

5.1 Future Work 

A key contribution of this work lies in the fact that it enables 
future extensions and optimisations. Most crucially perhaps 
is the development of a fully parallel mesh input reader in 
PETSc in order to overcome the remaining sequential bot¬ 
tleneck during model initialisation. This change, however, 
requires a new default mesh format for Fluidity due to the 
inherently sequential nature of the Gmsh file format, which 
again highlights the need for abstraction when optimising 
mesh management. 

In addition to the optimisation of mesh input and model ini¬ 
tialisation, further integration of DMPlex throughout Flu¬ 
idity is desirable to utilise DMPlex’s advanced I/O features, 
such as the HDF5-based Xdmf output format. For this pur¬ 
pose closer integration is required, where additional discreti¬ 
sation data needs to be passed to PETSc to perform all the 
necessary field I/O. Moreover, DMPlex’s mesh and data dis¬ 
tribution utility may also be used to provide load balancing 
after mesh adaptation. 
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